The objective of this project is to predict Airbnb rent based on 90 features. These features include various data types, and data cleaning and pre-processing is required before modeling. The result will be evaluated with RMSE, the lower the score is the better the model predicts.
The following code will initiate the R environment with required R packages and helper functions. Since the project will create an artificial neural network, TensorFlow and Keras is imported with miniconda, of which will facilitate any GPU computation that is needed for the network. Other than that, the flowing code also sets constant value for the entire project, including seed and number of cores.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 458072 24.5 989307 52.9 638940 34.2
## Vcells 833298 6.4 8388608 64.0 1635956 12.5
## Command for sourcing the URL:
## downloader::source_url("https://raw.githubusercontent.com/DMinghao/Analysis_Pocketknife/main/R/init_env.R", sha="9e657bc82974025a094b66de35cb78eb826621ab")
## Hash 9e657bc82974025a094b66de35cb78eb826621ab matches expected value.
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
## The following object is masked from 'package:plotly':
##
## slice
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:tensorflow':
##
## train
## The following object is masked from 'package:purrr':
##
## lift
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
##
## Attaching package: 'sylly'
## The following object is masked from 'package:psych':
##
## describe
## For information on available language packages for 'koRpus', run
##
## available.koRpus.lang()
##
## and see ?install.koRpus.lang()
##
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
##
## tokenize
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
##
## Attaching package: 'tm'
## The following object is masked from 'package:koRpus':
##
## readTagged
## Loading required package: ParamHelpers
## Loading required package: checkmate
## Loading required package: mlr
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
##
## Attaching package: 'mlr'
## The following object is masked from 'package:caret':
##
## train
## The following object is masked from 'package:tensorflow':
##
## train
## The following object is masked from 'package:e1071':
##
## impute
##
## Attaching package: 'DiceKriging'
## The following object is masked from 'package:checkmate':
##
## checkNames
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:fpc':
##
## dbscan
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## [1] "Miniconda is already installed. "
## [1] "r_tf_gpu is already installed. "
The provided data is broken down to two separate files, one for training, of which contain a price column, and a scoring data set that does not have the price column.
writeSubmit <- function(pred) {
submissionFile = data.frame(id = scoringData$id, price = pred)
write.csv(submissionFile, 'submission.csv', row.names = F)
}
rawData <- read_csv('./input/rentlala2021/analysisData.csv')
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## name = col_character(),
## summary = col_character(),
## space = col_character(),
## description = col_character(),
## neighborhood_overview = col_character(),
## notes = col_character(),
## transit = col_character(),
## access = col_character(),
## interaction = col_character(),
## house_rules = col_character(),
## host_name = col_character(),
## host_since = col_date(format = ""),
## host_location = col_character(),
## host_about = col_character(),
## host_response_time = col_character(),
## host_response_rate = col_character(),
## host_acceptance_rate = col_character(),
## host_is_superhost = col_logical(),
## host_neighbourhood = col_character(),
## host_verifications = col_character()
## # ... with 29 more columns
## )
## i Use `spec()` for the full column specifications.
scoringData <- read_csv('./input/rentlala2021/scoringData.csv')
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## name = col_character(),
## summary = col_character(),
## space = col_character(),
## description = col_character(),
## neighborhood_overview = col_character(),
## notes = col_character(),
## transit = col_character(),
## access = col_character(),
## interaction = col_character(),
## house_rules = col_character(),
## host_name = col_character(),
## host_since = col_date(format = ""),
## host_location = col_character(),
## host_about = col_character(),
## host_response_time = col_character(),
## host_response_rate = col_character(),
## host_acceptance_rate = col_character(),
## host_is_superhost = col_logical(),
## host_neighbourhood = col_character(),
## host_verifications = col_character()
## # ... with 30 more columns
## )
## i Use `spec()` for the full column specifications.
By doing some column comparison, we can see that two data sets contains columns that does not match each other. Thus, some preliminary house keeping is needed to keep all columns in two data set in sync.
compare_df_cols(rawData, scoringData) %>%
filter(rawData != scoringData |
rawData %>% is.na() | scoringData %>% is.na())
## column_name rawData scoringData
## 1 price numeric <NA>
## 2 zipcode numeric character
rawData %>% mutate(license = license %>% as.logical()) -> rawData
scoringData %>% mutate(zipcode = zipcode %>% as.numeric()) -> scoringData
## Warning in zipcode %>% as.numeric(): NAs introduced by coercion
compare_df_cols(rawData, scoringData) %>%
filter(rawData != scoringData | rawData %>% is.na() | scoringData %>% is.na())
## column_name rawData scoringData
## 1 price numeric <NA>
The following code will create a price column for the scoring data such that two data sets can bind together as one data set for the upcoming data pre-processing and cleaning.
pricedScoringData <- scoringData %>% mutate(price = -1)
allData <- rawData %>% bind_rows(pricedScoringData)
Due to the fact that most models can not handle missing data, character data, or unscaled data, it would be necessary to clean and preprocess data before moving forward. Other than that, it might be helpful to do some feature engineering such that models can have more data to work with. However, considering the fact that most models tend to run longer and perform worse with more feature added, feature selection will be needed to improve model performance and/or accuracy.
Given that there are different data types in the data set, bulk processing columns with the same data type might speed up the process. The following code serves the purpose of putting columns with same data type together to help with later data cleaning.
# Extract ID column
extractID <- allData %>% select(id)
allData <- allData %>% select(-id)
# Put similar columns into baskets
numericCols <- allData %>% select(is.numeric) %>% colnames()
boolCols <- allData %>% select(is.logical) %>% colnames()
dateCols <- allData %>% select(is.Date) %>% colnames()
charCols <- allData %>% select(is.character) %>% colnames()
longCharCols <-
allData[charCols] %>%
select(c(name,
summary,
space,
description,
neighborhood_overview,
notes,
transit,
access,
interaction,
house_rules,
host_about
)) %>%
colnames()
factorCharCols <-
allData[charCols] %>%
select(-c(name,
summary,
space,
description,
neighborhood_overview,
notes,
transit,
access,
interaction,
house_rules,
host_about
)) %>%
select(-c(
host_name,
host_verifications,
host_response_time,
calendar_updated,
host_response_rate,
host_acceptance_rate,
amenities
)) %>%
colnames()
rateCols <- allData %>%
select(c(host_response_rate, host_acceptance_rate)) %>%
colnames()
# Encode all text columns to UTF-8
allData[charCols] <- allData[charCols] %>% mutate_all(funs(enc2utf8(.)))
Having the data in various format and type prevents us to extract information from the data. Hence, the following section will help transform some columns such that they will be easier to work with.
Given that there exists some missing zip codes in the data set, we can fill in those data point with the most common zip code in their corresponding area.
mostCommonZip <- allData %>%
select(c(neighbourhood_cleansed, neighbourhood_group_cleansed, city, zipcode)) %>%
group_by(neighbourhood_cleansed, zipcode) %>%
summarise(count = n()) %>%
filter(count == max(count)) %>%
ungroup()
## `summarise()` has grouped output by 'neighbourhood_cleansed'. You can override using the `.groups` argument.
getZip <- function(neighbourhood_cleansed){
mostCommonZip %>% filter(neighbourhood_cleansed == neighbourhood_cleansed) %>% select(zipcode) %>% pull(zipcode)
}
allData <- allData %>% mutate(zipcode = ifelse(is.na(zipcode), getZip(neighbourhood_cleansed), zipcode)) %>% mutate(zipcode = zipcode %>% as.factor())
There exists some text columns that are in the percentage format. We can assume that all NA values in these columns are 0, since, in a real world scenario, it’s reasonable that theses NA values are derived from dividing 0. For the rest of the data point, we can parse to numeric.
allData[rateCols] <- allData[rateCols] %>%
mutate(host_response_rate = gsub("N/A","0%", host_response_rate)) %>%
mutate(host_acceptance_rate = gsub("N/A","0%", host_acceptance_rate)) %>%
mutate(host_response_rate = gsub("%","", host_response_rate)) %>%
mutate(host_acceptance_rate = gsub("%","", host_acceptance_rate)) %>%
mutate_all(as.numeric) %>%
mutate_each(funs(./100)) %>%
replace(is.na(.),0)
## Warning: `mutate_each_()` was deprecated in dplyr 0.7.0.
## Please use `across()` instead.
The following code handles square_feet, weekly_price, and monthly_price columns’ NA value. It might not be the best way, but setting these values to 0 is safe. One alternative will be using MISE imputation, but since the number of missing value is too large, it tempers the distribution of these columns. Another way that can be use is to train models that predicts these values.
allData[boolCols] <- allData[boolCols] %>% replace(is.na(.),F)
reservedNumCols <- allData %>% select(c(square_feet, weekly_price, monthly_price)) %>% colnames()
allData[numericCols[numericCols %!in% reservedNumCols]] <-
allData[numericCols[numericCols %!in% reservedNumCols]] %>% replace(is.na(.),0)
allData <- allData %>% mutate(square_feet = ifelse(is.na(square_feet), 0, square_feet))
allData <- allData %>% mutate(weekly_price = ifelse(is.na(weekly_price), 0, weekly_price))
allData <- allData %>% mutate(monthly_price = ifelse(is.na(monthly_price), 0, monthly_price))
There exists several text columns that can be converted to factors. Some of them are only having one level, so they should be drop. On the other hand, I decided to keep columns with large amount of levels as well, but it would be wise to down size these factors and grouping low frequency factors into other.
allData[factorCharCols] <- allData[factorCharCols] %>%
replace(is.na(.), "N/A") %>%
mutate_all(as.factor)
# Exclude 1 level factor columns
allData <- allData %>% select(-c(country_code, country, state, market))
For host verification and amenities columns, they are being transformed into one hot dummy columns per each unique items. This will allow models to process information in these two columns easier. The original columns are then dropped after being processed.
# Create verification count column
allData <- allData %>%
mutate(host_verifications = gsub("\\[|\\]|\\'|\\,", "", host_verifications)) %>%
mutate(vari_count = strsplit(host_verifications, " ") %>% lengths())
vari_list <- allData %>%
select(host_verifications) %>%
lapply(function(x) unique(trimws(unlist(strsplit(x, " ")))))
for(w in 1:length(vari_list[[1]])) {
new <- grepl(pattern = vari_list[[1]][w], x = allData$host_verifications, fixed = TRUE)
allData[paste(vari_list[[1]][w], "_vari")] <- new
}
# Create amenities count column
allData <- allData %>%
mutate(amenities_count = strsplit(amenities, ",") %>% lengths())
amen_list <- allData %>%
select(amenities) %>%
mutate(amenities = gsub("\\.", "", amenities)) %>%
lapply(function(x) unique(trimws(unlist(strsplit(x, ",")))))
for(w in 1:length(amen_list[[1]])) {
new <- grepl(pattern = amen_list[[1]][w], x = allData$amenities, fixed = TRUE)
allData[paste(amen_list[[1]][w], "_amen")] <- new
}
# discard original column
allData <- allData %>% select(-c(amenities, host_verifications)) %>% clean_names()
Duration columns are messy, since they are all in text format and does not have consistent step size. Therefore after stripping all text component in these columns, a column specific step size is chosen to help rescaling the duration.
allData <- allData %>%
mutate(host_response_time = gsub("within a ", "", host_response_time)) %>%
mutate(host_response_time = gsub("within an ", "", host_response_time)) %>%
mutate(host_response_time = gsub("few hours", "12", host_response_time)) %>%
mutate(host_response_time = gsub("hour", "1", host_response_time)) %>%
mutate(host_response_time = gsub("a few days or more", "48", host_response_time)) %>%
mutate(host_response_time = gsub("day", "24", host_response_time)) %>%
mutate(host_response_time = replace_na(host_response_time, "N/A")) %>%
mutate(host_response_time = gsub("N/A", "96", host_response_time)) %>%
mutate(host_response_time = as.numeric(host_response_time))
allData <- allData %>%
mutate(calendar_updated = gsub(" ago", "", calendar_updated)) %>%
mutate(calendar_updated = gsub("today", "0", calendar_updated)) %>%
mutate(calendar_updated = gsub("yesterday", "1", calendar_updated)) %>%
mutate(calendar_updated = case_when(
grepl("days", calendar_updated) ~ as.numeric(gsub("([0-9]+).*$", "\\1", calendar_updated)) %>% as.character(),
grepl("weeks", calendar_updated) ~ as.character(as.numeric(gsub("([0-9]+).*$", "\\1", calendar_updated))*7),
grepl("months", calendar_updated) ~ as.character(as.numeric(gsub("([0-9]+).*$", "\\1", calendar_updated))*30),
grepl("a week", calendar_updated) ~ "7",
grepl("week", calendar_updated) ~ "7",
grepl("never", calendar_updated) ~ "3000",
TRUE ~ as.character(calendar_updated)
)) %>%
mutate(calendar_updated = as.numeric(calendar_updated))
## Warning in as.numeric(gsub("([0-9]+).*$", "\\1", calendar_updated)) %>% : NAs
## introduced by coercion
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion
Some models can not process date columns, so we need to transform them to numeric. Using the latest date as 0, calculate other dates’ “distance” to the most resent date to get a consistent numeric transformation.
allData[dateCols] <- allData %>%
select(dateCols) %>%
mutate_all(funs(max(., na.rm = TRUE) - .)) %>%
mutate_all(as.numeric)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(dateCols)` instead of `dateCols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Having 90 columns is good, but why not more? Some models will benefit from having more features, especially meaningful features and features that were not created from other numeric features.
Considering that if I as a user on the platform and need to price my property, it would be helpful to know the average price in my area. The same applies to the artificial stupidity, providing an average price for each area is like helping it to cheat on a test, the result might be phenomenal. The following code adds three new columns to the data set: mean price for neighbourhood, for neighbourhood group, and for zip code.
mean_price <- allData %>%
filter(price > -1) %>%
group_by(neighbourhood_cleansed = neighbourhood_cleansed) %>%
summarize(record_count_c = n(),
price_mean_c = mean(price))
allData <- allData %>% left_join(mean_price, by = c("neighbourhood_cleansed" = "neighbourhood_cleansed"))
allData[is.na(allData$price_mean_c),]$price_mean_c = mean(allData["price" > -1, ]$price_mean_c, na.rm = TRUE)
mean_price2 <- allData %>%
filter(price > -1) %>%
group_by(neighbourhood_group_cleansed = neighbourhood_group_cleansed) %>%
summarize(price_mean_ngc = mean(price))
allData <- allData %>% left_join(mean_price2, by = c("neighbourhood_group_cleansed" = "neighbourhood_group_cleansed"))
mean_price3 <- allData %>%
filter(price > -1) %>%
group_by(zipcode = zipcode) %>%
summarize(price_mean_zip = mean(price))
allData <- allData %>% left_join(mean_price3, by = c("zipcode" = "zipcode"))
allData[is.na(allData$price_mean_zip),]$price_mean_zip = mean(allData["price" > -1, ]$price_mean_zip, na.rm = TRUE)
This new feature might be excessive, but I just can’t help myself. Knowing an host’s name gives the opportunity to guess their gender. Though, it might not be something that influence the price, but who knows, maybe male tends to price their property higher than female.
allData %>% select(host_name) %>% c() -> names
allData <- allData %>%
left_join(gender(names$host_name) %>%
distinct() %>%
select(c(name, proportion_male, proportion_female)),
by = c("host_name" = "name"))
Having the zip code column might not be helpful since they are all just numbers without meanings. Therefore, adding each zip code’s coordinate might be something useful. In a real world scenario, it’s safe to assume that properties in some areas is priced hiher than other, but it’s up to the model to findout.
ZipCodes <- read.table(unz("US.zip","US.txt"), sep="\t")
names(ZipCodes) = c("CountryCode", "zip", "PlaceName",
"AdminName1", "AdminCode1", "AdminName2", "AdminCode2",
"AdminName3", "AdminCode3", "latitude", "longitude", "accuracy")
ZipCodes <- ZipCodes %>% mutate(zip = as.factor(zip))
allData <- allData %>%
left_join(
ZipCodes %>%
select(c(zip,
PlaceName,
AdminName2,
latitude,
longitude)), by = c("zipcode"="zip")
)
allData <- allData %>% mutate(PlaceName = PlaceName %>% as.factor(),
AdminName2 = AdminName2 %>% as.factor())
The fun part is that all these “useless” long text columns can also be transformed to numeric columns. The textfeatures function will generate columns like word count, url count, exclamation mark count, etc. What is even better is that it also generates sentiment analysis and LDA topic clustering analysis on these text columns. Are these columns useful? It’s up to the model to find out.
allData[longCharCols] <- allData[longCharCols] %>% replace(is.na(.), "")
allText <- NULL
for(col in longCharCols) allText <- paste(allText, " ", allData[[col]])
tex_feat <- textfeatures(allText)
## [32m<U+21AA>[39m [38;5;244mCounting features in text...[39m
## [32m<U+21AA>[39m [38;5;244mSentiment analysis...[39m
## [32m<U+21AA>[39m [38;5;244mParts of speech...[39m
## [32m<U+21AA>[39m [38;5;244mWord dimensions started[39m
## [32m<U+21AA>[39m [38;5;244mNormalizing data[39m
## [32m<U+2714>[39m Job's done!
allData <- allData %>% bind_cols(tex_feat)
Assuming that not all features are having leaner relationship with the price, rasing 2 and 3 power to all numeric features might help the model to fit better.
copyData <- allData
allData <- allData %>% cbind( copyData %>%
select(where(is.numeric)) %>%
select(-price) %>%
mutate_all(function(x) x^2) %>%
setNames(paste0(names(.), "_2pow")))
allData <- allData %>% cbind( copyData %>%
select(where(is.numeric)) %>%
select(-price) %>%
mutate_all(function(x) x^3) %>%
setNames(paste0(names(.), "_3pow")))
Scaling is not necessary for some models, and not all models benifits from scaled data, but here we are, why not.
numCols <- allData %>% select(is.numeric) %>% colnames()
facCols <- allData %>% select(is.factor) %>% colnames()
boolCols <- allData %>% select(is.logical) %>% colnames()
allData[numCols] <- allData[numCols] %>%
replace(is.na(.),0) %>%
mutate_at(vars(-price), funs(scale))
allData[facCols] <- allData[facCols] %>%
mutate_all(funs(as.numeric(.)-1))
allData[boolCols] <- allData[boolCols] %>%
mutate_all(funs(as.numeric(.)))
Zero variance features are meaningless, it’s wise to drop them.
remove0var <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
name <- unlist(want) %>% names()
print(name)
dat %>% select(-all_of(name))
}
allData <- allData %>% remove0var()
## [1] "has_availability"
## [2] "requires_license"
## [3] "license"
## [4] "jurisdiction_names"
## [5] "is_business_travel_ready"
## [6] "translation_missing_enhosting_amenity_50_amen"
## [7] "translation_missing_enhosting_amenity_49_amen"
Considering rent price can be classified as high, medium, and low price range, let’s consider would it be helpful if the model also knows if a property is in which price range? Having this in mind, clustering data might segregate data into different price range. The following code should generates 15 clusters for the data set. Due to the time limit, I just used the simplest clustering method, k-mean, and hand picked the number of cluster. However, HDBSCAN would be a better choice of clustering method in this case. (too bad it runs too long and does not support parallel processing)
beforeCluster <- allData
allData <- beforeCluster %>% select(!is.character)
wss <- (nrow(allData)-1)*sum(apply(allData,2,var))
for (i in 2:20) { # increase for better result
set.seed(seed)
# print(i)
clu <- kmeans(allData %>% select(-price), centers=i)
wss[i] <- sum(clu$withinss)
}
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
set.seed(seed)
kc <- kmeans(allData, centers=15)
beforeCluster %>% mutate(cluster = kc$cluster %>% as.factor()) -> afterCluster
Like mean price for areas, haveing the mean price for each cluster might also help the model find the more appropreate price.
mean_price_cluster <- afterCluster %>%
filter(price > -1) %>%
group_by(cluster = cluster) %>%
summarize(price_mean_clus = mean(price))
afterCluster <- afterCluster %>% left_join(mean_price_cluster, by = c("cluster" = "cluster"))
# plotting each cluster's price distribution
pp <- afterCluster %>% select(c(price, cluster)) %>% ggplot(aes(x = price, color = cluster)) + geom_density()
pp %>% ggplotly()
allData <- afterCluster %>% mutate(cluster = cluster %>% as.numeric())
After all data are being cleaned, pre-processed, and transformed, we need to prep it for modeling.
Training set and testing set are separated.
trainData <- allData %>%
select(!is.character) %>%
filter(price > -1)
testData <- allData %>%
select(!is.character) %>%
filter(price == -1) %>% select(-price)
To lighten the load of some models, we can perform a feature selection on our data set. Some models might even perform better with less model. The following code uses boruta algorithm, a tree based model, to calculate feature importance.
boruta_output <-
Boruta(
price ~ .,
data = trainData,
pValue = 0.05,
maxRuns = 500,
doTrace = 2,
getImp = getImpXgboost, # delete this line for better result, keep it for faster output
nthread=cores,
)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## 11. run of importance source...
## 12. run of importance source...
## 13. run of importance source...
## 14. run of importance source...
## 15. run of importance source...
## After 15 iterations, +3.2 mins:
## confirmed 14 attributes: accommodates, bathrooms, bedrooms, cleaning_fee, cluster and 9 more;
## rejected 1024 attributes: accessible_height_bed_amen, accessible_height_toilet_amen, accommodates_2pow, accommodates_3pow, AdminName2 and 1019 more;
## still have 26 attributes left.
## 16. run of importance source...
## 17. run of importance source...
## 18. run of importance source...
## 19. run of importance source...
## After 19 iterations, +3.2 mins:
## confirmed 1 attribute: availability_365;
## rejected 4 attributes: price_mean_ngc, w125, w139_2pow, w56;
## still have 21 attributes left.
## 20. run of importance source...
## 21. run of importance source...
## 22. run of importance source...
## 23. run of importance source...
## After 23 iterations, +3.3 mins:
## rejected 1 attribute: zipcode;
## still have 20 attributes left.
## 24. run of importance source...
## 25. run of importance source...
## 26. run of importance source...
## After 26 iterations, +3.3 mins:
## rejected 1 attribute: n_nonasciis;
## still have 19 attributes left.
## 27. run of importance source...
## 28. run of importance source...
## 29. run of importance source...
## 30. run of importance source...
## After 30 iterations, +3.3 mins:
## confirmed 1 attribute: w50;
## still have 18 attributes left.
## 31. run of importance source...
## 32. run of importance source...
## 33. run of importance source...
## 34. run of importance source...
## 35. run of importance source...
## 36. run of importance source...
## 37. run of importance source...
## 38. run of importance source...
## 39. run of importance source...
## After 39 iterations, +3.4 mins:
## confirmed 1 attribute: minimum_nights;
## rejected 1 attribute: host_since;
## still have 16 attributes left.
## 40. run of importance source...
## 41. run of importance source...
## 42. run of importance source...
## After 42 iterations, +3.4 mins:
## rejected 1 attribute: calculated_host_listings_count_entire_homes;
## still have 15 attributes left.
## 43. run of importance source...
## 44. run of importance source...
## 45. run of importance source...
## 46. run of importance source...
## 47. run of importance source...
## 48. run of importance source...
## 49. run of importance source...
## 50. run of importance source...
## 51. run of importance source...
## 52. run of importance source...
## 53. run of importance source...
## 54. run of importance source...
## 55. run of importance source...
## 56. run of importance source...
## 57. run of importance source...
## 58. run of importance source...
## 59. run of importance source...
## 60. run of importance source...
## 61. run of importance source...
## 62. run of importance source...
## 63. run of importance source...
## 64. run of importance source...
## 65. run of importance source...
## 66. run of importance source...
## 67. run of importance source...
## 68. run of importance source...
## 69. run of importance source...
## After 69 iterations, +3.6 mins:
## confirmed 1 attribute: minimum_nights_avg_ntm;
## still have 14 attributes left.
## 70. run of importance source...
## 71. run of importance source...
## 72. run of importance source...
## 73. run of importance source...
## 74. run of importance source...
## 75. run of importance source...
## After 75 iterations, +3.6 mins:
## confirmed 1 attribute: availability_60;
## still have 13 attributes left.
## 76. run of importance source...
## 77. run of importance source...
## 78. run of importance source...
## 79. run of importance source...
## 80. run of importance source...
## 81. run of importance source...
## 82. run of importance source...
## 83. run of importance source...
## 84. run of importance source...
## 85. run of importance source...
## 86. run of importance source...
## 87. run of importance source...
## After 87 iterations, +3.7 mins:
## confirmed 1 attribute: calculated_host_listings_count;
## still have 12 attributes left.
## 88. run of importance source...
## 89. run of importance source...
## 90. run of importance source...
## After 90 iterations, +3.7 mins:
## confirmed 1 attribute: first_review;
## still have 11 attributes left.
## 91. run of importance source...
## 92. run of importance source...
## 93. run of importance source...
## 94. run of importance source...
## 95. run of importance source...
## After 95 iterations, +3.7 mins:
## confirmed 1 attribute: n_commas;
## still have 10 attributes left.
## 96. run of importance source...
## 97. run of importance source...
## 98. run of importance source...
## 99. run of importance source...
## 100. run of importance source...
## After 100 iterations, +3.8 mins:
## confirmed 1 attribute: n_charsperword_2pow;
## still have 9 attributes left.
## 101. run of importance source...
## 102. run of importance source...
## 103. run of importance source...
## 104. run of importance source...
## 105. run of importance source...
## 106. run of importance source...
## 107. run of importance source...
## 108. run of importance source...
## 109. run of importance source...
## 110. run of importance source...
## 111. run of importance source...
## 112. run of importance source...
## 113. run of importance source...
## 114. run of importance source...
## 115. run of importance source...
## 116. run of importance source...
## 117. run of importance source...
## 118. run of importance source...
## 119. run of importance source...
## 120. run of importance source...
## 121. run of importance source...
## 122. run of importance source...
## 123. run of importance source...
## 124. run of importance source...
## 125. run of importance source...
## After 125 iterations, +3.9 mins:
## confirmed 1 attribute: reviews_per_month;
## still have 8 attributes left.
## 126. run of importance source...
## 127. run of importance source...
## 128. run of importance source...
## 129. run of importance source...
## 130. run of importance source...
## 131. run of importance source...
## 132. run of importance source...
## 133. run of importance source...
## 134. run of importance source...
## 135. run of importance source...
## 136. run of importance source...
## 137. run of importance source...
## 138. run of importance source...
## 139. run of importance source...
## 140. run of importance source...
## 141. run of importance source...
## 142. run of importance source...
## 143. run of importance source...
## 144. run of importance source...
## 145. run of importance source...
## 146. run of importance source...
## 147. run of importance source...
## 148. run of importance source...
## 149. run of importance source...
## 150. run of importance source...
## 151. run of importance source...
## 152. run of importance source...
## 153. run of importance source...
## 154. run of importance source...
## 155. run of importance source...
## 156. run of importance source...
## 157. run of importance source...
## 158. run of importance source...
## 159. run of importance source...
## 160. run of importance source...
## 161. run of importance source...
## After 161 iterations, +4.2 mins:
## rejected 1 attribute: elevator_amen;
## still have 7 attributes left.
## 162. run of importance source...
## 163. run of importance source...
## After 163 iterations, +4.2 mins:
## rejected 1 attribute: building_staff_amen;
## still have 6 attributes left.
## 164. run of importance source...
## 165. run of importance source...
## 166. run of importance source...
## 167. run of importance source...
## 168. run of importance source...
## 169. run of importance source...
## 170. run of importance source...
## 171. run of importance source...
## 172. run of importance source...
## 173. run of importance source...
## 174. run of importance source...
## 175. run of importance source...
## After 175 iterations, +4.3 mins:
## confirmed 1 attribute: w194;
## still have 5 attributes left.
## 176. run of importance source...
## 177. run of importance source...
## 178. run of importance source...
## 179. run of importance source...
## 180. run of importance source...
## 181. run of importance source...
## 182. run of importance source...
## After 182 iterations, +4.3 mins:
## confirmed 1 attribute: record_count_c;
## still have 4 attributes left.
## 183. run of importance source...
## 184. run of importance source...
## 185. run of importance source...
## 186. run of importance source...
## 187. run of importance source...
## 188. run of importance source...
## 189. run of importance source...
## 190. run of importance source...
## 191. run of importance source...
## 192. run of importance source...
## 193. run of importance source...
## 194. run of importance source...
## 195. run of importance source...
## 196. run of importance source...
## 197. run of importance source...
## 198. run of importance source...
## 199. run of importance source...
## 200. run of importance source...
## 201. run of importance source...
## 202. run of importance source...
## 203. run of importance source...
## 204. run of importance source...
## 205. run of importance source...
## 206. run of importance source...
## 207. run of importance source...
## 208. run of importance source...
## 209. run of importance source...
## 210. run of importance source...
## 211. run of importance source...
## 212. run of importance source...
## 213. run of importance source...
## 214. run of importance source...
## 215. run of importance source...
## 216. run of importance source...
## 217. run of importance source...
## 218. run of importance source...
## 219. run of importance source...
## 220. run of importance source...
## 221. run of importance source...
## 222. run of importance source...
## 223. run of importance source...
## 224. run of importance source...
## 225. run of importance source...
## 226. run of importance source...
## 227. run of importance source...
## 228. run of importance source...
## 229. run of importance source...
## 230. run of importance source...
## 231. run of importance source...
## 232. run of importance source...
## 233. run of importance source...
## 234. run of importance source...
## 235. run of importance source...
## 236. run of importance source...
## 237. run of importance source...
## 238. run of importance source...
## 239. run of importance source...
## 240. run of importance source...
## 241. run of importance source...
## 242. run of importance source...
## 243. run of importance source...
## 244. run of importance source...
## 245. run of importance source...
## 246. run of importance source...
## 247. run of importance source...
## 248. run of importance source...
## 249. run of importance source...
## 250. run of importance source...
## 251. run of importance source...
## 252. run of importance source...
## 253. run of importance source...
## 254. run of importance source...
## 255. run of importance source...
## 256. run of importance source...
## 257. run of importance source...
## 258. run of importance source...
## 259. run of importance source...
## 260. run of importance source...
## 261. run of importance source...
## 262. run of importance source...
## 263. run of importance source...
## 264. run of importance source...
## 265. run of importance source...
## 266. run of importance source...
## 267. run of importance source...
## 268. run of importance source...
## 269. run of importance source...
## 270. run of importance source...
## 271. run of importance source...
## 272. run of importance source...
## After 272 iterations, +4.8 mins:
## confirmed 1 attribute: security_deposit;
## still have 3 attributes left.
## 273. run of importance source...
## 274. run of importance source...
## 275. run of importance source...
## 276. run of importance source...
## 277. run of importance source...
## 278. run of importance source...
## 279. run of importance source...
## 280. run of importance source...
## 281. run of importance source...
## 282. run of importance source...
## 283. run of importance source...
## 284. run of importance source...
## 285. run of importance source...
## 286. run of importance source...
## 287. run of importance source...
## After 287 iterations, +4.9 mins:
## confirmed 1 attribute: w160;
## still have 2 attributes left.
## 288. run of importance source...
## 289. run of importance source...
## 290. run of importance source...
## 291. run of importance source...
## 292. run of importance source...
## 293. run of importance source...
## 294. run of importance source...
## 295. run of importance source...
## 296. run of importance source...
## 297. run of importance source...
## 298. run of importance source...
## 299. run of importance source...
## 300. run of importance source...
## 301. run of importance source...
## 302. run of importance source...
## 303. run of importance source...
## 304. run of importance source...
## 305. run of importance source...
## 306. run of importance source...
## 307. run of importance source...
## 308. run of importance source...
## 309. run of importance source...
## 310. run of importance source...
## 311. run of importance source...
## 312. run of importance source...
## 313. run of importance source...
## 314. run of importance source...
## 315. run of importance source...
## 316. run of importance source...
## 317. run of importance source...
## 318. run of importance source...
## 319. run of importance source...
## 320. run of importance source...
## 321. run of importance source...
## 322. run of importance source...
## 323. run of importance source...
## 324. run of importance source...
## 325. run of importance source...
## 326. run of importance source...
## 327. run of importance source...
## 328. run of importance source...
## 329. run of importance source...
## 330. run of importance source...
## 331. run of importance source...
## 332. run of importance source...
## 333. run of importance source...
## 334. run of importance source...
## 335. run of importance source...
## 336. run of importance source...
## 337. run of importance source...
## 338. run of importance source...
## 339. run of importance source...
## 340. run of importance source...
## 341. run of importance source...
## 342. run of importance source...
## 343. run of importance source...
## 344. run of importance source...
## 345. run of importance source...
## 346. run of importance source...
## 347. run of importance source...
## 348. run of importance source...
## 349. run of importance source...
## 350. run of importance source...
## 351. run of importance source...
## 352. run of importance source...
## 353. run of importance source...
## 354. run of importance source...
## 355. run of importance source...
## 356. run of importance source...
## 357. run of importance source...
## 358. run of importance source...
## 359. run of importance source...
## 360. run of importance source...
## 361. run of importance source...
## 362. run of importance source...
## 363. run of importance source...
## 364. run of importance source...
## 365. run of importance source...
## 366. run of importance source...
## 367. run of importance source...
## 368. run of importance source...
## 369. run of importance source...
## 370. run of importance source...
## 371. run of importance source...
## 372. run of importance source...
## 373. run of importance source...
## 374. run of importance source...
## 375. run of importance source...
## 376. run of importance source...
## 377. run of importance source...
## 378. run of importance source...
## 379. run of importance source...
## 380. run of importance source...
## 381. run of importance source...
## 382. run of importance source...
## 383. run of importance source...
## 384. run of importance source...
## 385. run of importance source...
## 386. run of importance source...
## 387. run of importance source...
## 388. run of importance source...
## 389. run of importance source...
## 390. run of importance source...
## 391. run of importance source...
## 392. run of importance source...
## 393. run of importance source...
## 394. run of importance source...
## 395. run of importance source...
## 396. run of importance source...
## 397. run of importance source...
## 398. run of importance source...
## 399. run of importance source...
## 400. run of importance source...
## 401. run of importance source...
## 402. run of importance source...
## 403. run of importance source...
## 404. run of importance source...
## 405. run of importance source...
## 406. run of importance source...
## 407. run of importance source...
## 408. run of importance source...
## 409. run of importance source...
## 410. run of importance source...
## 411. run of importance source...
## 412. run of importance source...
## 413. run of importance source...
## 414. run of importance source...
## 415. run of importance source...
## 416. run of importance source...
## 417. run of importance source...
## 418. run of importance source...
## 419. run of importance source...
## 420. run of importance source...
## 421. run of importance source...
## 422. run of importance source...
## 423. run of importance source...
## 424. run of importance source...
## 425. run of importance source...
## 426. run of importance source...
## 427. run of importance source...
## 428. run of importance source...
## 429. run of importance source...
## 430. run of importance source...
## 431. run of importance source...
## 432. run of importance source...
## 433. run of importance source...
## 434. run of importance source...
## 435. run of importance source...
## 436. run of importance source...
## 437. run of importance source...
## 438. run of importance source...
## 439. run of importance source...
## 440. run of importance source...
## 441. run of importance source...
## 442. run of importance source...
## 443. run of importance source...
## 444. run of importance source...
## 445. run of importance source...
## 446. run of importance source...
## 447. run of importance source...
## 448. run of importance source...
## 449. run of importance source...
## 450. run of importance source...
## 451. run of importance source...
## 452. run of importance source...
## 453. run of importance source...
## 454. run of importance source...
## 455. run of importance source...
## 456. run of importance source...
## 457. run of importance source...
## 458. run of importance source...
## 459. run of importance source...
## 460. run of importance source...
## 461. run of importance source...
## 462. run of importance source...
## 463. run of importance source...
## 464. run of importance source...
## 465. run of importance source...
## 466. run of importance source...
## 467. run of importance source...
## 468. run of importance source...
## 469. run of importance source...
## 470. run of importance source...
## 471. run of importance source...
## 472. run of importance source...
## 473. run of importance source...
## 474. run of importance source...
## 475. run of importance source...
## 476. run of importance source...
## 477. run of importance source...
## 478. run of importance source...
## 479. run of importance source...
## 480. run of importance source...
## 481. run of importance source...
## 482. run of importance source...
## 483. run of importance source...
## 484. run of importance source...
## 485. run of importance source...
## 486. run of importance source...
## 487. run of importance source...
## 488. run of importance source...
## 489. run of importance source...
## 490. run of importance source...
## 491. run of importance source...
## 492. run of importance source...
## 493. run of importance source...
## 494. run of importance source...
## 495. run of importance source...
## 496. run of importance source...
## 497. run of importance source...
## 498. run of importance source...
## 499. run of importance source...
boruta_dec <- attStats(boruta_output) %>% rownames_to_column()
boruta_dec[boruta_dec$decision!="Rejected",]
## rowname meanImp medianImp minImp
## 2 host_location 0.0033427998 0.0033617345 3.047511e-03
## 7 host_neighbourhood 0.0021625261 0.0021401783 1.557652e-03
## 12 street 0.0021153634 0.0020174007 1.485645e-03
## 21 room_type 0.1959787970 0.1959818572 1.938758e-01
## 22 accommodates 0.0227605405 0.0227393011 2.186984e-02
## 23 bathrooms 0.0216957316 0.0216981411 2.042193e-02
## 24 bedrooms 0.0198406905 0.0197150066 1.900117e-02
## 30 security_deposit 0.0006128238 0.0006154920 2.624091e-04
## 31 cleaning_fee 0.0129010121 0.0129065781 1.188153e-02
## 32 guests_included 0.0034149188 0.0033952619 2.412445e-03
## 34 minimum_nights 0.0007221129 0.0006862727 5.258360e-04
## 40 minimum_nights_avg_ntm 0.0007946778 0.0008194881 0.000000e+00
## 44 availability_60 0.0007950922 0.0006925343 2.701578e-04
## 46 availability_365 0.0017642350 0.0017577582 1.187772e-03
## 49 first_review 0.0006576752 0.0006347074 0.000000e+00
## 62 calculated_host_listings_count 0.0007851487 0.0007956137 9.578249e-05
## 66 reviews_per_month 0.0006977084 0.0006674355 3.775158e-04
## 157 pack_n_play_travel_crib_amen 0.0005587213 0.0004811592 4.767161e-04
## 239 record_count_c 0.0005679868 0.0005416404 4.351942e-04
## 240 price_mean_c 0.0252872969 0.0252895704 2.381056e-02
## 242 price_mean_zip 0.0175870231 0.0175826144 1.690098e-02
## 257 n_commas 0.0007293119 0.0007196047 4.267457e-04
## 332 w50 0.0009930672 0.0009152918 5.420321e-04
## 409 w127 0.0049106311 0.0048791081 4.269551e-03
## 442 w160 0.0005748175 0.0005504995 0.000000e+00
## 476 w194 0.0006005853 0.0005699136 2.803771e-05
## 560 n_charsperword_2pow 0.0007917217 0.0007619573 0.000000e+00
## 735 w163_2pow 0.0005090496 0.0004929428 0.000000e+00
## 1063 cluster 0.0257475562 0.0257783355 2.471778e-02
## 1064 price_mean_clus 0.6268139325 0.6269105073 6.223772e-01
## maxImp normHits decision
## 2 0.0039837437 1.0000000 Confirmed
## 7 0.0032626345 1.0000000 Confirmed
## 12 0.0026547337 1.0000000 Confirmed
## 21 0.1967295665 1.0000000 Confirmed
## 22 0.0231964287 1.0000000 Confirmed
## 23 0.0225098318 1.0000000 Confirmed
## 24 0.0205361218 1.0000000 Confirmed
## 30 0.0011427048 0.6492986 Confirmed
## 31 0.0139397632 1.0000000 Confirmed
## 32 0.0039202326 1.0000000 Confirmed
## 34 0.0014344560 0.8396794 Confirmed
## 40 0.0014033275 0.8016032 Confirmed
## 44 0.0013389270 0.8356713 Confirmed
## 46 0.0024376726 0.9979960 Confirmed
## 49 0.0012600224 0.6933868 Confirmed
## 62 0.0015102141 0.8296593 Confirmed
## 66 0.0014420813 0.7434870 Confirmed
## 157 0.0008670325 0.5711423 Tentative
## 239 0.0009161837 0.6392786 Confirmed
## 240 0.0261493500 1.0000000 Confirmed
## 242 0.0185064882 1.0000000 Confirmed
## 257 0.0011875958 0.7795591 Confirmed
## 332 0.0017583969 0.9498998 Confirmed
## 409 0.0054291993 1.0000000 Confirmed
## 442 0.0012786424 0.6152305 Confirmed
## 476 0.0011192599 0.6452906 Confirmed
## 560 0.0017780480 0.7615230 Confirmed
## 735 0.0011220953 0.4969940 Tentative
## 1063 0.0258560092 1.0000000 Confirmed
## 1064 0.6283600051 1.0000000 Confirmed
selectedCols <- boruta_output %>% getSelectedAttributes(withTentative = TRUE)
trainData <- trainData %>% select(c(price, selectedCols))
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(selectedCols)` instead of `selectedCols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
testData <- testData %>% select(selectedCols)
if (saveProcessedData) {
write.csv(trainData,
file("processedTrainData.csv",encoding="UTF-8"),
row.names = F)
write.csv(testData,
file("processedTestData.csv",encoding="UTF-8"),
row.names = F)
}
This is the exciting part, we will use three models to fit our training data and see which one performs the best.
This is a baseline model, there is really nothing to see here. (the result is eye-burning)
linear <- lm(price~., data = trainData)
summary(linear)
##
## Call:
## lm(formula = price ~ ., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -353.55 -27.23 -4.94 20.58 683.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.111e+01 1.238e+00 49.374 < 2e-16 ***
## host_location -2.225e-03 9.244e-04 -2.407 0.016101 *
## host_neighbourhood -1.066e-02 2.123e-03 -5.023 5.10e-07 ***
## street -7.309e-03 3.710e-03 -1.970 0.048859 *
## room_type -1.977e+01 3.344e-01 -59.127 < 2e-16 ***
## accommodates 1.447e+01 4.497e-01 32.189 < 2e-16 ***
## bathrooms 5.382e+00 2.993e-01 17.981 < 2e-16 ***
## bedrooms 1.122e+01 3.760e-01 29.838 < 2e-16 ***
## security_deposit -1.534e-01 2.881e-01 -0.533 0.594346
## cleaning_fee 6.693e+00 3.515e-01 19.040 < 2e-16 ***
## guests_included 3.265e+00 3.357e-01 9.726 < 2e-16 ***
## minimum_nights -5.504e+00 3.725e-01 -14.779 < 2e-16 ***
## minimum_nights_avg_ntm 1.565e+00 3.738e-01 4.186 2.85e-05 ***
## availability_60 4.640e+00 3.686e-01 12.586 < 2e-16 ***
## availability_365 1.367e+00 3.738e-01 3.658 0.000255 ***
## first_review -3.232e-01 2.856e-01 -1.132 0.257805
## calculated_host_listings_count -3.862e+00 3.003e-01 -12.858 < 2e-16 ***
## reviews_per_month -4.473e+00 2.920e-01 -15.321 < 2e-16 ***
## pack_n_play_travel_crib_amen 1.162e+01 1.481e+00 7.850 4.27e-15 ***
## record_count_c 2.217e+00 2.872e-01 7.717 1.21e-14 ***
## price_mean_c 1.076e+01 6.683e-01 16.104 < 2e-16 ***
## price_mean_zip 1.236e+01 6.634e-01 18.625 < 2e-16 ***
## n_commas 3.275e-01 2.772e-01 1.181 0.237501
## w50 -6.934e-02 2.704e-01 -0.256 0.797646
## w127 5.138e+00 2.641e-01 19.451 < 2e-16 ***
## w160 -3.651e-01 2.615e-01 -1.396 0.162650
## w194 8.452e-01 2.675e-01 3.160 0.001578 **
## n_charsperword_2pow 6.999e-02 2.510e-01 0.279 0.780321
## w163_2pow -1.331e-01 2.722e-01 -0.489 0.624822
## cluster 7.191e-01 6.132e-02 11.726 < 2e-16 ***
## price_mean_clus 6.899e-01 3.688e-03 187.040 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.62 on 41299 degrees of freedom
## Multiple R-squared: 0.7671, Adjusted R-squared: 0.7669
## F-statistic: 4533 on 30 and 41299 DF, p-value: < 2.2e-16
prep <- predict(linear, newdata = testData)
# writeSubmit(pred)
XGBoost has a great reputation on kaggle. We will see how it performs on our data set. Instead of grid search, we will be using the Bayes method to find the best hyper parameter set, as it performs faster than grid search and yield better result than randomly selecting hyper parameter values.
# Create training matrix
BoostTrainData <- xgb.DMatrix(model.matrix(price ~ ., data = trainData),
label = as.matrix(trainData %>% select(price)))
For the Bayes method, we need an objective function that allows the algorithm to collect data/error score on different sets of hyper parameters. Inside of this objective function, each time it runs, an XGboost with 10 fold cv will be run with selected hyper parameter set, the test error score (RMSE) will then be recorded for this specific set of hyper parameter.
# objective function for bayes hyperparameter tuning method
obj.fun <- makeSingleObjectiveFunction(
# name of the objective function
name = "xgb_cv_bayes",
# the xgboost function
fn = function(x) {
set.seed(seed)
print(x)
cv <- xgb.cv(
params = list(
booster = "gbtree",
eta = x["eta"],
max_depth = x["max_depth"],
min_child_weight = x["min_child_weight"],
gamma = x["gamma"],
lambda = x["lambda"],
alpha = x["alpha"],
subsample = x["subsample"],
colsample_bytree = x["colsample_bytree"],
max_delta_step = x["max_delta_step"],
tweedie_variance_power = x["tweedie_variance_power"],
objective = 'reg:tweedie',
eval_metric = 'rmse'
),
data = BoostTrainData,
nround = 7000,
nthread = cores,
nfold = 10,
prediction = FALSE,
showsd = TRUE,
early_stopping_rounds = 5,
verbose = 1,
print_every_n = 500
)
cv$evaluation_log %>% pull(4) %>% min
},
# hyperparameters
par.set = makeParamSet(
makeNumericParam("eta", lower = 0.005, upper = 0.36),
makeNumericParam("gamma", lower = 1, upper = 8),
makeNumericParam("lambda", lower = 1, upper = 8),
makeNumericParam("alpha", lower = 1, upper = 8),
makeIntegerParam("max_depth", lower = 2, upper = 20),
makeIntegerParam("min_child_weight", lower = 1, upper = 2000),
makeNumericParam("subsample", lower = 0.01, upper = 1),
makeNumericParam("colsample_bytree", lower = 0.01, upper = 1),
makeNumericParam("max_delta_step", lower = 0, upper = 10),
makeNumericParam("tweedie_variance_power", lower = 1, upper = 2)
),
# objective (minimizing rmse)
minimize = TRUE
)
The driver function for the Bayes method is mainly modeling the RMSE with tested hyper parameter sets. It then will iterate the process (run xgboost with more hyper parameter sets) to optimize the model and minimize the objective(test RMSE). Finally it will return the best performing hyper parameter set.
# Driver function
do_bayes <-
function(n_design = NULL,
opt_steps = NULL,
of = obj.fun,
seed = seed) {
set.seed(seed)
des <- generateDesign(n = n_design,
par.set = getParamSet(of),
fun = lhs::randomLHS)
control <-
makeMBOControl() %>% setMBOControlTermination(., iters = opt_steps)
# modeling rmse from hyperparameters (actrual driver function)
run <- mbo(
fun = of,
design = des,
learner = makeLearner(
"regr.km",
predict.type = "se",
covtype = "matern3_2",
control = list(trace = FALSE)
),
control = control,
show.info = TRUE
)
# ploting the bayes result
opt_plot <- run$opt.path$env$path %>%
mutate(Round = row_number()) %>%
mutate(type = case_when(Round <= n_design ~ "Design",
TRUE ~ "mlrMBO optimization")) %>%
ggplot(aes(x = Round, y = y, color = type)) +
geom_point() +
labs(title = "mlrMBO optimization") +
ylab("-log(likelihood)")
return(list(run = run, plot = opt_plot))
}
Enough talking, the following code runs the Bayes method to tune the XGBoost model. After it finish tuning, the best performing hyper parameter set will be used to generate the final model for sumission.
# Let's go!!!
# with 20 initial runs that will be used to create model and follow with another 5 runs to optimize the model, and lastly the best hyper parameter set will be generated (increase these numbers to yield better result)
runs <-
do_bayes(
n_design = 15, # 500
of = obj.fun,
opt_steps = 5, # 1000
seed = seed
)
## Computing y column(s) for design. Not provided.
## eta gamma lambda
## 0.1433897 2.1655127 3.0476909
## alpha max_depth min_child_weight
## 3.4483429 18.0000000 1252.0000000
## subsample colsample_bytree max_delta_step
## 0.5097477 0.4863149 4.9631787
## tweedie_variance_power
## 1.4811994
## [1] train-rmse:175.849486+0.346263 test-rmse:175.822342+3.113339
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [80] train-rmse:38.942908+0.298351 test-rmse:47.661586+0.920629
##
## eta gamma lambda
## 0.34822109 7.69848880 5.96999307
## alpha max_depth min_child_weight
## 4.90819985 3.00000000 605.00000000
## subsample colsample_bytree max_delta_step
## 0.05886207 0.68459279 5.60849741
## tweedie_variance_power
## 1.81771245
## [1] train-rmse:175.778471+0.346324 test-rmse:175.751070+3.113864
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [130] train-rmse:57.547262+0.699675 test-rmse:57.725537+2.503902
##
## eta gamma lambda
## 1.990936e-02 1.552453e+00 7.936367e+00
## alpha max_depth min_child_weight
## 6.481732e+00 6.000000e+00 1.950000e+03
## subsample colsample_bytree max_delta_step
## 3.645677e-01 3.089692e-01 4.583122e+00
## tweedie_variance_power
## 1.572117e+00
## [1] train-rmse:175.969369+0.346170 test-rmse:175.942174+3.112450
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## [501] train-rmse:49.610105+0.111876 test-rmse:50.899946+1.257388
## [1001] train-rmse:46.847943+0.104132 test-rmse:49.344957+0.970688
## Stopping. Best iteration:
## [1059] train-rmse:46.634659+0.106617 test-rmse:49.258266+0.969962
##
## eta gamma lambda
## 0.07220784 7.21247581 4.59720215
## alpha max_depth min_child_weight
## 1.12367498 5.00000000 65.00000000
## subsample colsample_bytree max_delta_step
## 0.29149216 0.44326297 2.60879215
## tweedie_variance_power
## 1.72679163
## [1] train-rmse:175.942735+0.346202 test-rmse:175.915547+3.112651
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [465] train-rmse:47.670128+0.098051 test-rmse:49.052876+1.013779
##
## eta gamma lambda
## 0.2718219 2.5500253 5.6599657
## alpha max_depth min_child_weight
## 5.9559969 14.0000000 729.0000000
## subsample colsample_bytree max_delta_step
## 0.7628510 0.9775092 1.4348977
## tweedie_variance_power
## 1.7715245
## [1] train-rmse:175.819989+0.346289 test-rmse:175.792921+3.113554
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [57] train-rmse:44.483513+0.144819 test-rmse:47.899496+0.863629
##
## eta gamma lambda
## 0.2236498 6.5824342 5.0139707
## alpha max_depth min_child_weight
## 1.7070139 8.0000000 518.0000000
## subsample colsample_bytree max_delta_step
## 0.1660301 0.2270238 0.7804302
## tweedie_variance_power
## 1.8781448
## [1] train-rmse:175.908124+0.346218 test-rmse:175.881842+3.112901
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [137] train-rmse:55.744936+0.442474 test-rmse:55.977466+1.493213
##
## eta gamma lambda
## 0.1584083 6.7180503 7.1675856
## alpha max_depth min_child_weight
## 2.7147090 19.0000000 227.0000000
## subsample colsample_bytree max_delta_step
## 0.5612247 0.7629903 3.2139675
## tweedie_variance_power
## 1.3945905
## [1] train-rmse:175.793785+0.346342 test-rmse:175.766392+3.113746
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [55] train-rmse:28.004347+0.132317 test-rmse:47.451966+1.311953
##
## eta gamma lambda
## 0.2010198 4.7151822 6.9399174
## alpha max_depth min_child_weight
## 7.8536417 16.0000000 987.0000000
## subsample colsample_bytree max_delta_step
## 0.9655378 0.9049767 9.7988411
## tweedie_variance_power
## 1.4610456
## [1] train-rmse:175.773486+0.346340 test-rmse:175.745911+3.113904
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [43] train-rmse:34.270440+0.184371 test-rmse:46.353579+0.997256
##
## eta gamma lambda
## 0.3273010 5.3999300 6.3863042
## alpha max_depth min_child_weight
## 3.8272900 2.0000000 1387.0000000
## subsample colsample_bytree max_delta_step
## 0.8150512 0.1969072 6.7685915
## tweedie_variance_power
## 1.3202279
## [1] train-rmse:175.306444+0.346752 test-rmse:175.278867+3.117355
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [177] train-rmse:48.534345+0.733089 test-rmse:50.266158+1.156792
##
## eta gamma lambda
## 0.3019416 4.1090816 3.4134593
## alpha max_depth min_child_weight
## 2.2287592 17.0000000 364.0000000
## subsample colsample_bytree max_delta_step
## 0.6303300 0.6679775 0.6449322
## tweedie_variance_power
## 1.0484468
## [1] train-rmse:175.899840+0.346229 test-rmse:175.872408+3.112977
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [45] train-rmse:17.006546+0.619475 test-rmse:52.741373+1.020810
##
## eta gamma lambda
## 0.03475413 5.77016770 2.21008797
## alpha max_depth min_child_weight
## 5.32484333 15.00000000 905.00000000
## subsample colsample_bytree max_delta_step
## 0.68109201 0.03669035 8.31438419
## tweedie_variance_power
## 1.99120712
## [1] train-rmse:175.969275+0.346163 test-rmse:175.942012+3.112453
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## [501] train-rmse:66.110576+1.265927 test-rmse:66.360620+3.592282
## [1001] train-rmse:58.655420+0.358635 test-rmse:58.941298+1.882641
## Stopping. Best iteration:
## [1153] train-rmse:58.262771+0.213446 test-rmse:58.538830+1.784587
##
## eta gamma lambda
## 9.790842e-02 3.744923e+00 4.114621e+00
## alpha max_depth min_child_weight
## 7.043411e+00 1.200000e+01 1.144000e+03
## subsample colsample_bytree max_delta_step
## 9.819786e-02 9.892343e-02 6.013350e+00
## tweedie_variance_power
## 1.166854e+00
## [1] train-rmse:175.682570+0.346422 test-rmse:175.655051+3.114572
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [164] train-rmse:46.865919+0.539149 test-rmse:54.303232+1.403258
##
## eta gamma lambda
## 0.1912814 1.0029404 2.4336267
## alpha max_depth min_child_weight
## 4.3432986 10.0000000 1832.0000000
## subsample colsample_bytree max_delta_step
## 0.2721438 0.3707872 8.7162563
## tweedie_variance_power
## 1.1099212
## [1] train-rmse:174.330325+0.347725 test-rmse:174.302690+3.124370
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [32] train-rmse:43.197246+0.735794 test-rmse:51.808310+1.724017
##
## eta gamma lambda
## 0.1176200 2.9306495 1.6517395
## alpha max_depth min_child_weight
## 2.9716598 10.0000000 1653.0000000
## subsample colsample_bytree max_delta_step
## 0.4102853 0.5529085 7.3697527
## tweedie_variance_power
## 1.6446855
## [1] train-rmse:175.905284+0.346218 test-rmse:175.878613+3.112913
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [233] train-rmse:44.555446+0.196047 test-rmse:48.913560+0.908900
##
## eta gamma lambda
## 0.2487132 4.7803367 1.0335552
## alpha max_depth min_child_weight
## 7.4775389 8.0000000 1576.0000000
## subsample colsample_bytree max_delta_step
## 0.9328213 0.8088376 3.3993612
## tweedie_variance_power
## 1.2532554
## [1] train-rmse:175.467448+0.346584 test-rmse:175.440715+3.116189
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [32] train-rmse:36.999666+0.303130 test-rmse:46.991657+1.150315
## [mbo] 0: eta=0.143; gamma=2.17; lambda=3.05; alpha=3.45; max_depth=18; min_child_weight=1252; subsample=0.51; colsample_bytree=0.486; max_delta_step=4.96; tweedie_variance_power=1.48 : y = 47.7 : 29.5 secs : initdesign
## [mbo] 0: eta=0.348; gamma=7.7; lambda=5.97; alpha=4.91; max_depth=3; min_child_weight=605; subsample=0.0589; colsample_bytree=0.685; max_delta_step=5.61; tweedie_variance_power=1.82 : y = 57.7 : 10.7 secs : initdesign
## [mbo] 0: eta=0.0199; gamma=1.55; lambda=7.94; alpha=6.48; max_depth=6; min_child_weight=1950; subsample=0.365; colsample_bytree=0.309; max_delta_step=4.58; tweedie_variance_power=1.57 : y = 49.3 : 137.8 secs : initdesign
## [mbo] 0: eta=0.0722; gamma=7.21; lambda=4.6; alpha=1.12; max_depth=5; min_child_weight=65; subsample=0.291; colsample_bytree=0.443; max_delta_step=2.61; tweedie_variance_power=1.73 : y = 49.1 : 56.9 secs : initdesign
## [mbo] 0: eta=0.272; gamma=2.55; lambda=5.66; alpha=5.96; max_depth=14; min_child_weight=729; subsample=0.763; colsample_bytree=0.978; max_delta_step=1.43; tweedie_variance_power=1.77 : y = 47.9 : 24.5 secs : initdesign
## [mbo] 0: eta=0.224; gamma=6.58; lambda=5.01; alpha=1.71; max_depth=8; min_child_weight=518; subsample=0.166; colsample_bytree=0.227; max_delta_step=0.78; tweedie_variance_power=1.88 : y = 56 : 15.5 secs : initdesign
## [mbo] 0: eta=0.158; gamma=6.72; lambda=7.17; alpha=2.71; max_depth=19; min_child_weight=227; subsample=0.561; colsample_bytree=0.763; max_delta_step=3.21; tweedie_variance_power=1.39 : y = 47.5 : 37.0 secs : initdesign
## [mbo] 0: eta=0.201; gamma=4.72; lambda=6.94; alpha=7.85; max_depth=16; min_child_weight=987; subsample=0.966; colsample_bytree=0.905; max_delta_step=9.8; tweedie_variance_power=1.46 : y = 46.4 : 24.7 secs : initdesign
## [mbo] 0: eta=0.327; gamma=5.4; lambda=6.39; alpha=3.83; max_depth=2; min_child_weight=1387; subsample=0.815; colsample_bytree=0.197; max_delta_step=6.77; tweedie_variance_power=1.32 : y = 50.3 : 12.1 secs : initdesign
## [mbo] 0: eta=0.302; gamma=4.11; lambda=3.41; alpha=2.23; max_depth=17; min_child_weight=364; subsample=0.63; colsample_bytree=0.668; max_delta_step=0.645; tweedie_variance_power=1.05 : y = 52.7 : 18.9 secs : initdesign
## [mbo] 0: eta=0.0348; gamma=5.77; lambda=2.21; alpha=5.32; max_depth=15; min_child_weight=905; subsample=0.681; colsample_bytree=0.0367; max_delta_step=8.31; tweedie_variance_power=1.99 : y = 58.5 : 112.8 secs : initdesign
## [mbo] 0: eta=0.0979; gamma=3.74; lambda=4.11; alpha=7.04; max_depth=12; min_child_weight=1144; subsample=0.0982; colsample_bytree=0.0989; max_delta_step=6.01; tweedie_variance_power=1.17 : y = 54.3 : 27.4 secs : initdesign
## [mbo] 0: eta=0.191; gamma=1; lambda=2.43; alpha=4.34; max_depth=10; min_child_weight=1832; subsample=0.272; colsample_bytree=0.371; max_delta_step=8.72; tweedie_variance_power=1.11 : y = 51.8 : 8.2 secs : initdesign
## [mbo] 0: eta=0.118; gamma=2.93; lambda=1.65; alpha=2.97; max_depth=10; min_child_weight=1653; subsample=0.41; colsample_bytree=0.553; max_delta_step=7.37; tweedie_variance_power=1.64 : y = 48.9 : 53.7 secs : initdesign
## [mbo] 0: eta=0.249; gamma=4.78; lambda=1.03; alpha=7.48; max_depth=8; min_child_weight=1576; subsample=0.933; colsample_bytree=0.809; max_delta_step=3.4; tweedie_variance_power=1.25 : y = 47 : 10.4 secs : initdesign
## eta gamma lambda
## 0.1798297 2.9852834 5.7587270
## alpha max_depth min_child_weight
## 6.5558814 10.0000000 1206.0000000
## subsample colsample_bytree max_delta_step
## 0.7898417 0.9367973 3.3939614
## tweedie_variance_power
## 1.4980765
## [1] train-rmse:175.815965+0.346299 test-rmse:175.788626+3.113577
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [61] train-rmse:38.718709+0.254796 test-rmse:46.401449+1.123193
## [mbo] 1: eta=0.18; gamma=2.99; lambda=5.76; alpha=6.56; max_depth=10; min_child_weight=1206; subsample=0.79; colsample_bytree=0.937; max_delta_step=3.39; tweedie_variance_power=1.5 : y = 46.4 : 22.6 secs : infill_cb
## eta gamma lambda
## 0.1541928 4.3861109 5.5257429
## alpha max_depth min_child_weight
## 5.0235506 19.0000000 1264.0000000
## subsample colsample_bytree max_delta_step
## 0.9086676 0.7783669 4.1549332
## tweedie_variance_power
## 1.3475073
## [1] train-rmse:175.769217+0.346364 test-rmse:175.742021+3.113918
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [53] train-rmse:28.683528+0.132036 test-rmse:46.496345+1.277794
## [mbo] 2: eta=0.154; gamma=4.39; lambda=5.53; alpha=5.02; max_depth=19; min_child_weight=1264; subsample=0.909; colsample_bytree=0.778; max_delta_step=4.15; tweedie_variance_power=1.35 : y = 46.5 : 33.2 secs : infill_cb
## eta gamma lambda
## 0.1846294 3.8622876 2.0013141
## alpha max_depth min_child_weight
## 7.7323198 17.0000000 539.0000000
## subsample colsample_bytree max_delta_step
## 0.9134028 0.8936029 4.5158724
## tweedie_variance_power
## 1.3751894
## [1] train-rmse:175.739752+0.346364 test-rmse:175.712285+3.114134
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [45] train-rmse:23.743065+0.272348 test-rmse:46.974959+1.235782
## [mbo] 3: eta=0.185; gamma=3.86; lambda=2; alpha=7.73; max_depth=17; min_child_weight=539; subsample=0.913; colsample_bytree=0.894; max_delta_step=4.52; tweedie_variance_power=1.38 : y = 47 : 31.5 secs : infill_cb
## eta gamma lambda
## 0.1235886 3.0338489 6.2845442
## alpha max_depth min_child_weight
## 3.1953927 19.0000000 708.0000000
## subsample colsample_bytree max_delta_step
## 0.9193236 0.8764768 2.7139873
## tweedie_variance_power
## 1.5308240
## [1] train-rmse:175.882330+0.346252 test-rmse:175.854909+3.113097
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [88] train-rmse:32.019518+0.194947 test-rmse:45.989405+0.970723
## [mbo] 4: eta=0.124; gamma=3.03; lambda=6.28; alpha=3.2; max_depth=19; min_child_weight=708; subsample=0.919; colsample_bytree=0.876; max_delta_step=2.71; tweedie_variance_power=1.53 : y = 46 : 52.0 secs : infill_cb
## eta gamma lambda
## 0.1394113 3.1417033 7.9378679
## alpha max_depth min_child_weight
## 6.7552343 8.0000000 431.0000000
## subsample colsample_bytree max_delta_step
## 0.9315066 0.8207677 4.1949087
## tweedie_variance_power
## 1.6853234
## [1] train-rmse:175.896364+0.346238 test-rmse:175.868819+3.112994
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [143] train-rmse:40.871966+0.107210 test-rmse:46.315541+0.825370
## [mbo] 5: eta=0.139; gamma=3.14; lambda=7.94; alpha=6.76; max_depth=8; min_child_weight=431; subsample=0.932; colsample_bytree=0.821; max_delta_step=4.19; tweedie_variance_power=1.69 : y = 46.3 : 38.4 secs : infill_cb
plot(runs$run)
## Loading required package: grid
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
best.params <- runs$run$x
# run the model with the best hyperparamerter set
set.seed(seed)
optimal.cv <- xgb.cv(
params = best.params,
data = BoostTrainData,
nround = 7000,
nthread = cores,
nfold = 10,
prediction = FALSE,
showsd = TRUE,
early_stopping_rounds = 5,
verbose = 1,
print_every_n = 100,
objective = 'reg:tweedie',
eval_metric = 'rmse'
)
## [1] train-rmse:175.882329+0.346253 test-rmse:175.854912+3.113096
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [88] train-rmse:32.019517+0.194948 test-rmse:45.989404+0.970722
# make the final model
set.seed(seed)
model <-
xgboost(
params = best.params,
data = BoostTrainData,
nrounds = optimal.cv$best_ntreelimit
)
## [21:22:39] WARNING: amalgamation/../src/learner.cc:573:
## Parameters: { "tweedie_variance_power" } might not be used.
##
## This may not be accurate due to some parameters are only used in language bindings but
## passed down to XGBoost core. Or some parameters are not used but slip through this
## verification. Please open an issue if you find above cases.
##
##
## [1] train-rmse:175.723572
## [2] train-rmse:175.462982
## [3] train-rmse:175.203705
## [4] train-rmse:174.944748
## [5] train-rmse:174.685608
## [6] train-rmse:174.427063
## [7] train-rmse:174.168762
## [8] train-rmse:173.910477
## [9] train-rmse:173.652267
## [10] train-rmse:173.394501
## [11] train-rmse:173.136703
## [12] train-rmse:172.879425
## [13] train-rmse:172.622498
## [14] train-rmse:172.366196
## [15] train-rmse:172.110199
## [16] train-rmse:171.853424
## [17] train-rmse:171.598679
## [18] train-rmse:171.342407
## [19] train-rmse:171.086716
## [20] train-rmse:170.831055
## [21] train-rmse:170.577621
## [22] train-rmse:170.323364
## [23] train-rmse:170.069031
## [24] train-rmse:169.815414
## [25] train-rmse:169.561539
## [26] train-rmse:169.308365
## [27] train-rmse:169.055222
## [28] train-rmse:168.802490
## [29] train-rmse:168.549866
## [30] train-rmse:168.297882
## [31] train-rmse:168.045532
## [32] train-rmse:167.794220
## [33] train-rmse:167.543152
## [34] train-rmse:167.292191
## [35] train-rmse:167.041199
## [36] train-rmse:166.791031
## [37] train-rmse:166.540985
## [38] train-rmse:166.291321
## [39] train-rmse:166.041840
## [40] train-rmse:165.792297
## [41] train-rmse:165.543625
## [42] train-rmse:165.295258
## [43] train-rmse:165.047195
## [44] train-rmse:164.798660
## [45] train-rmse:164.550827
## [46] train-rmse:164.303635
## [47] train-rmse:164.056915
## [48] train-rmse:163.807983
## [49] train-rmse:163.563339
## [50] train-rmse:163.317780
## [51] train-rmse:163.072449
## [52] train-rmse:162.826447
## [53] train-rmse:162.581375
## [54] train-rmse:162.336700
## [55] train-rmse:162.091904
## [56] train-rmse:161.847977
## [57] train-rmse:161.603806
## [58] train-rmse:161.360504
## [59] train-rmse:161.117310
## [60] train-rmse:160.874939
## [61] train-rmse:160.631699
## [62] train-rmse:160.390198
## [63] train-rmse:160.148087
## [64] train-rmse:159.906403
## [65] train-rmse:159.665649
## [66] train-rmse:159.424744
## [67] train-rmse:159.184143
## [68] train-rmse:158.944031
## [69] train-rmse:158.704697
## [70] train-rmse:158.464340
## [71] train-rmse:158.225388
## [72] train-rmse:157.987122
## [73] train-rmse:157.748276
## [74] train-rmse:157.510681
## [75] train-rmse:157.272934
## [76] train-rmse:157.034973
## [77] train-rmse:156.798660
## [78] train-rmse:156.562012
## [79] train-rmse:156.325256
## [80] train-rmse:156.089981
## [81] train-rmse:155.854843
## [82] train-rmse:155.618713
## [83] train-rmse:155.384979
## [84] train-rmse:155.150299
## [85] train-rmse:154.916473
## [86] train-rmse:154.682724
## [87] train-rmse:154.449814
## [88] train-rmse:154.216431
# take a peek
summary(model)
## Length Class Mode
## handle 1 xgb.Booster.handle externalptr
## raw 40092 -none- raw
## niter 1 -none- numeric
## evaluation_log 2 data.table list
## call 13 -none- call
## params 11 -none- list
## callbacks 2 -none- list
## feature_names 31 -none- character
## nfeatures 1 -none- numeric
# predict
pred <-
predict(model, model.matrix(price~., testData %>% mutate(price = -1)) %>% xgb.DMatrix())
# make submission csv
# writeSubmit(pred)
This is truly an artificial stupidity. (Such a shame)
NetTrainX <- trainData %>% select(-price) %>% as.matrix()
NetTrainY <- trainData %>% select(price) %>% as.matrix()
inputSize <- dim(NetTrainX)[2]
offsetSize <- 0.45
dropout <- 0.5
activate <- "relu"
scaleSize <- inputSize * offsetSize
model <- keras_model_sequential() %>%
layer_dense(units = scaleSize*2, activation = activate, input_shape = dim(NetTrainX)[2]) %>%
layer_dropout(rate = dropout) %>%
layer_dense(units = scaleSize*4, activation = activate) %>%
layer_dropout(rate = dropout) %>%
layer_dense(units = scaleSize*8, activation = activate) %>%
layer_dropout(rate = dropout) %>%
layer_dense(units = scaleSize*4, activation = activate) %>%
layer_dropout(rate = dropout) %>%
layer_dense(units = scaleSize*2, activation = activate) %>%
layer_dropout(rate = dropout) %>%
layer_dense(units = scaleSize*0.5, activation = activate) %>%
layer_dense(units = scaleSize*0.25, activation = activate) %>%
layer_dense(units = 1, activation = activate)
model %>% compile(
loss = "mse",
optimizer = "nadam",
metrics = c("mape","mse")
)
model %>% summary()
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_7 (Dense) (None, 27) 837
## ________________________________________________________________________________
## dropout_4 (Dropout) (None, 27) 0
## ________________________________________________________________________________
## dense_6 (Dense) (None, 54) 1512
## ________________________________________________________________________________
## dropout_3 (Dropout) (None, 54) 0
## ________________________________________________________________________________
## dense_5 (Dense) (None, 108) 5940
## ________________________________________________________________________________
## dropout_2 (Dropout) (None, 108) 0
## ________________________________________________________________________________
## dense_4 (Dense) (None, 54) 5886
## ________________________________________________________________________________
## dropout_1 (Dropout) (None, 54) 0
## ________________________________________________________________________________
## dense_3 (Dense) (None, 27) 1485
## ________________________________________________________________________________
## dropout (Dropout) (None, 27) 0
## ________________________________________________________________________________
## dense_2 (Dense) (None, 6) 168
## ________________________________________________________________________________
## dense_1 (Dense) (None, 3) 21
## ________________________________________________________________________________
## dense (Dense) (None, 1) 4
## ================================================================================
## Total params: 15,853
## Trainable params: 15,853
## Non-trainable params: 0
## ________________________________________________________________________________
set.seed(seed)
model %>% fit(
NetTrainX,
NetTrainY,
epochs = 20, # change to make it think longer
batch_size = 128,
validation_split = 0.2,
verbose = 2
)
scores <- model %>% evaluate(NetTrainX, NetTrainY, verbose = 0)
print(scores)
## loss mape mse
## 17233.57 9187553.00 17233.57
summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_7 (Dense) (None, 27) 837
## ________________________________________________________________________________
## dropout_4 (Dropout) (None, 27) 0
## ________________________________________________________________________________
## dense_6 (Dense) (None, 54) 1512
## ________________________________________________________________________________
## dropout_3 (Dropout) (None, 54) 0
## ________________________________________________________________________________
## dense_5 (Dense) (None, 108) 5940
## ________________________________________________________________________________
## dropout_2 (Dropout) (None, 108) 0
## ________________________________________________________________________________
## dense_4 (Dense) (None, 54) 5886
## ________________________________________________________________________________
## dropout_1 (Dropout) (None, 54) 0
## ________________________________________________________________________________
## dense_3 (Dense) (None, 27) 1485
## ________________________________________________________________________________
## dropout (Dropout) (None, 27) 0
## ________________________________________________________________________________
## dense_2 (Dense) (None, 6) 168
## ________________________________________________________________________________
## dense_1 (Dense) (None, 3) 21
## ________________________________________________________________________________
## dense (Dense) (None, 1) 4
## ================================================================================
## Total params: 15,853
## Trainable params: 15,853
## Non-trainable params: 0
## ________________________________________________________________________________
pred <- predict(model, testData %>% as.matrix())
# writeSubmit(pred)
In conclusion, there are still so much room for improvement, like outlier handling, better NA value handling, or better clustering method. If time is permitted, it might be better to manually implement the essemble learning process for xgboost and mixing it with random forest (just a thought, might need years to get it done). Other than that, the current model (the one that produces best RMSE on kaggle) takes too long to run (about 3 days) and most of the time are wasted on feature selection, finding the most cluster number, and tuning XGBoost.